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THE PHYLOGENETIC KANTOROVICH-RUBINSTEIN METRIC 
FOR ENVIRONMENTAL SEQUENCE SAMPLES 

STEVEN N. EVANS AND FREDERICK A. MATSEN 

Abstract. Using modern technology, it is now common to survey microbial 
communities by sequencing DNA or RNA extracted in bulk from a given envi- 
ronment. Comparative methods are needed that indicate the extent to which 
^- two communities differ given data sets of this type. UniFrac, a method built 

j^ around a somewhat ad hoc phylogenetics-based distance between two commu- 

^H nities, is one of the most commonly used tools for these analyses. We provide 

^H a foundation for such methods by establishing that if one equates a metage- 

_j_ nomic sample with its empirical distribution on a reference phylogcnetic tree, 

then the weighted UniFrac distance between two samples is just the classical 
I | Kantorovich-Rubinstein (KR) distance between the corresponding empirical 

distributions. We demonstrate that this KR distance and extensions of it that 
^ arise from incorporating uncertainty in the location of sample points can be 

written as a readily computable integral over the tree, we develop LP Zolotarev- 
O type generalizations of the metric, and we show how the p- value of the resulting 

natural permutation test of the null hypothesis "no difference between the two 

communities" can be approximated using a functional of a Gaussian process 

Qi^ indexed by the tree. We relate the L 2 case to an ANOVA-type decomposition 

I I and find that the distribution of its associated Gaussian functional is that of 

a computable linear combination of independent y? random variables. 
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Next-generation sequencing technology enables sequencing of hundreds of thou- 
sands to millions of short DNA sequences in a single experiment. This has led 
to a new methodology for characterizing the collection of microbes in a sample: 
rather than using observed morphology or the results of culturing experiments, it 
is possible to sequence directly genetic material extracted in bulk from the sample. 
This technology has revolutionized the possibilities for unbiased surveys of environ- 



X 



mental microbial diversity, ranging from the human gut (Gill et al. 20061 to acid 



mine drainages (Baker and Banfield 2003). We consider statistical comparison 



procedures for such DNA samples. 

We have divided this introductory section into several subsections. We begin 
in Subsection |1.1| by reviewing the definition of the UniFrac metrics that were de- 
veloped by microbial ecologists wishing to assign biologically meaningful distances 



2010 Mathematics Subject Classification. Primary: 92B10, 62P10; secondary: 62G10, 60B05, 
60G15. 

Key words and phrases. phylogenetics, metagenomics, Wasserstein metric, Monte Carlo, 
permutation test, randomization test, optimal transport, Gaussian process, reproducing kernel 
Hilbcrt space, barycenter, negative curvature, Hadamard space, CAT(k) space . 

The first author was supported in part by NSF grant DMS-0907630. The second author 
was supported by the Miller Institute for Basic Research in Science, University of California at 
Berkeley, by FHCRC startup funds, and NIH grant HG005966-01. 

1 



STEVEN N. EVANS AND FREDERICK A. MATSEN 



between two samples of the type described above. The metrics in the UniFrac 
papers are defined without preliminary justification via formulas. Although it has 
been pointed out that alternative ways of using phylogenetic branch lengths are pos- 



sible ( Faith et al. 2009 ), there has been little work investigating the extent to which 



there is a deeper mathematical rationale for these measures of similarity. With the 
goal of building a more mathematically founded comparative framework, we next 
observe in Subsection |1.2| that DNA from an environmental sample for a given locus 
can be thought of naturally as a probability distribution on a reference phylogenetic 



tree, and proceed to propose in Subsection 1.3 the Kantorovich-Rubinstein (KR) 
metric as a familiar and sensible way of comparing two such probability distribu- 
tions. Wc then observe in the same subsection how the KR metric can be computed 
via a simple integral over the tree, and that the resulting distance is in fact a gen- 
eralization of UniFrac. The final subsection of the introduction, Subsection |1.4| 
summarizes the other results of the paper. 

1.1. Introduction to UniFrac and its variants. In 2005, Lozupone and Knight 
introduced the UniFrac comparison measure to quantify the phylogenetic difference 



between microbial communities ( Lozupone and Knight 2005 1, and in 2007 they and 
others proposed a corresponding weighted version (Lozupone et al. 2007). These 



two papers already have hundreds of citations in total, attesting to their centrality 
in microbial community analysis. Researchers have used UniFrac to analyze micro- 



bial communities on the human hand (Fierer et al. 2008), establish the existence 



of a distinct gut microbial community associated with inflammatory bowel disease 



(Frank et al. 2007), and demonstrate that host genetics play a major part in de- 
termining intestinal microbiota (Rawls et al. 2006). The distance matrices derived 



from the UniFrac method are also commonly employed as input to clustering al- 



gorithms, including hierarchical clustering and UPGMA (Lozupone et al. 2007). 



Furthermore, the distances are widely used in conjunction with ordination methods 



such as PC A (Rintala et al. 20081 or to discover microbial community gradients 
with respect to another factor, such as ocean depth (Desnues et al. 2008). Two 



of the major metagenomic analysis "pipelines" developed in 2010 had a UniFrac 



analysis as one of their endpoints (Caporaso et al. 2010 Hartman et al. 2010). 



Recently, the software used to compute the two UniFrac distances has been re- 



optimized for speed (Hamady et al. 



heavily used mothur (Schloss et al 



2009) and it has been re- implemented in the 



2009 1 microbial analysis software package. 



The unweighted UniFrac distance uses only presence-absence data and is defined 
as follows. Imagine that one has two samples A and B of sequences. Call each such 
sequence a read. Build a phylogenetic tree on the total collection of reads. Color the 
tree according to the samples - if a given branch sits on a path between two reads 
from sample A, then it is colored red, if it sits on a path between two reads from 
sample B, then it is colored blue, and if both, then it is colored gray. Unweighted 
UniFrac is then the fraction of the total branch length that is "unique" to one of 
the samples: that is, it is the fraction of the total branch length that is either red 
or blue. 

Weighted UniFrac incorporates information about the frequencies of reads from 
the two samples by assigning weights to branch lengths that are not just or 1. 
Assume there are m reads from sample A and n reads in sample B, and that one 
builds a phylogenetic tree T from all m + n reads. For a given branch i of the tree 
T, let £j be the length of branch i and define fi to be the branch length fraction of 
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branch i, i.e. li divided by the total branch length of T. The formula for the (raw) 
weighted UniFrac distance between the two samples is 

(1) : ' A B * 



t=i 



where Ai and Bi are the respective number of descendants of branch i from com- 
munities A and B (Lozupone et al. 2007). In order to determine whether or not 



a read is a descendant of a branch, one needs to prescribe a vertex of the tree as 
being the root, but it turns out that different choices of the root lead to the same 
value of the distance because 
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and the quantity on the right only depends on the proportions of reads in each 
sample that are in the two disjoint subtrees obtained by deleting the branch i. 
Also, similar reasoning shows that the (unweighted) UniFrac distance is, up to a 
factor of |, given by a formula similar to (fTl) in which Ai/m (respectively, Bi/n) 
is replaced by a quantity that is either 1 or depending on whether there are any 
descendants of branch i in the A (respectively, B) sample and the branch length 
li is replaced by the branch length fraction /j. Using the quantities li rather than 
the fi simply changes the resulting distance by a multiplicative constant, the total 
branch length of the tree T. 

The UniFrac distances can also be calculated using a pre-existing tree (rather 
than one built from samples) by performing a sequence comparison such as BLAST 
to associate a read with a previously identified sequence and attaching the read 
to that sequence's leaf in the pre-existing tree with an intervening branch of zero 
length. Using this mapping strategy, the tree used for comparison can be adjusted 
depending on the purpose of the analysis. For example, the user may prefer an 
"ultrametric" tree (one with the same total branch length from the root to each 
tip) instead of one made with branch lengths that reflect amounts of molecular 
evolution. 

With the goal of making reported UniFrac values comparable across different 
trees, it is common to divide by a suitable scalar to fit them into the unit interval. 
Given a rooted tree T and counts Ai and Bi as above, the raw weighted UniFrac 
value is bounded above by 



(3) 



D 



i=\ 



A, 



B, 



where di is the distance from the root to the leaf side of edge i (Lozupone et al 



2007). When divided by this factor, the resulting scaled UniFrac values sit in 
the unit interval; a scaled UniFrac value of one means that there exists a branch 
adjacent to the root which can be cut to separate the two samples. Note that the 
factor D, and consequently the "normalized" weighted UniFrac value, does depend 
on the position of the root. 

A statistical significance for the observed UniFrac distance is typically assigned 
by a permutation procedure that we review here for the sake of completeness. The 



idea of a permutation test (also known as a randomization test) goes back to Fisher 
( |1935[ > and |Pitman| ( |1937a|b[ [1938] ) (see Good, 2005, and Edgington and Onghena, 
2007, for guides to the more recent literature). Suppose that our data are a pair of 
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samples with counts m and n, respectively, and that we have computed the UniFrac 
distance between the samples. Imagine creating a new pair of "samples" by taking 
some other subset of size m and its complement from the set of all m + n reads and 
then computing the distance between the two new samples. The proportion of the 
\ ) choices of such pairs of samples that result in a distance larger than the one 
observed in the data is an indication of the significance of the observed distance. Of 
course, we can rephrase this procedure as taking a uniform random subset of reads 
of size m and its complement (call such an object a random pair of pseudo-samples) 
and asking for the probability that the distance between these is greater than the 
observed one. Consequently, it is possible (and computationally necessary for large 
values of m and n) to approximate the proportion/probability in question by taking 
repeated independent choices of the random subset and recording the proportion 
of choices for which there is a distance between the pair of pseudo-samples greater 
than the observed one. We call the distribution of the distance between a random 
pair of pseudo-samples produced from a uniform random subset of reads of size 
m and its complement of size n the distribution of the distance under the null 
hypothesis of no clustering. 

1.2. Phylogenetic placement and probability distributions on a phyloge- 
netic tree. We now describe how it is natural to begin with a fixed reference 
phylogenetic tree constructed from previously-characterized DNA sequences and 
then use likelihood-based phylogenetic methods to map a DNA sample from some 
environment to a collection of phylogenetic placements on the reference tree. This 
collection of placements can then be thought of as a probability distribution on the 
reference tree. 

In classical likelihood-based phylogenetics (see, e.g., Felsenstein, 2004), one has 
data consisting of DNA sequences from a collection of taxa (e.g. species) and a 
probability model for that data. The probability model is composed of two ingre- 
dients. The first ingredient is a tree with branch lengths that has its leaves labeled 
by the taxa and describes their evolutionary relationship. The second ingredient 
is a Markovian stochastic mechanism for the evolution of DNA along the branches 
of the tree. The parameters of the model are the tree (its topology and branch 
lengths) and the rate parameters in the DNA evolution model. The likelihood of 
the data is, as usual, the function on the parameter space that gives the proba- 
bility of the observed data. The tree and rate parameters can be estimated using 
standard approaches such as maximum likelihood or Bayesian methods. 

Suppose one already has, from whatever source, DNA sequences for each of a 
number of taxa along with a corresponding phylogenetic tree and rate parameters, 
and that a new sequence, the query sequence, arrives. Rather than estimate a new 
tree and rate parameters ah initio, one can take the rate parameters as given and 
only consider trees that consist of the existing tree, the reference tree, augmented 
by a branch of some length leading from an attachment point on the reference tree 
to a leaf labeled by the new taxon. The relevant likelihood is now the conditional 
probability of the query sequence as a function of the attachment point and the 
pendant branch length, and one can input this likelihood into maximum likelihood 
or Bayesian methods to estimate these two parameters. For example, a maximum- 
likelihood point phylogenetic placement for a given query sequence is the maximum- 
likelihood estimate of the attachment point of the sequence to the tree and the 
pendant branch length leading to the sequence. Such estimates are produced by 
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a number of algorithms (Von Mering et al. 2007 Monier et al. 2008 Berger and 



Stamatakis 2010 Matsen et al. 2010). Typically, if there is more than one query 
sequence, then this procedure is applied in isolation to each one using the same 
reference tree; that is, the taxa corresponding to the successive query sequences 
aren't used to enlarge the reference tree. By fixing a reference tree rather than 
attempting to build a phylogenetic tree for the sample de novo, recent algorithms 
of this type are able to place tens of thousands of query sequences per hour per 
processor on a reference tree of one thousand taxa, with linear performance scaling 
in the number of reference taxa. 

For the purposes of this paper, the data we retain from a collection of point 
phylogenetic placements will simply be the attachment locations of those place- 
ments on the reference phylogenetic tree. We will call these positions placement 
locations. We can identify such a set of placement locations with its empirical dis- 
tribution, that is, with the probability distribution that places an equal mass at 
each placement. In this way, starting with a reference tree and an aligned collection 
of reads, we arrive at a probability distribution on the reference tree representing 
the distribution of those reads across the tree. 

One can also adopt a Bayesian perspective and assume a prior probability on 
the branch to which the attachment is made, the attachment location within that 
branch, and the pendant branch length, in order to calculate a posterior probability 
distribution for a placement. For example, one might take a prior for the attachment 
location and pendant branch length that assumes these quantities are independent, 
with the prior distribution for the attachment location being uniform over branches 
and uniform within each branch and with the prior distribution of the pendant 
branch length being exponential or uniform over some range. By integrating out 
the pendant branch length, one obtains a posterior probability distribution fii on 
the tree for query sequence i. We call such a probability distribution a spread 
placement: with priors such as those above, /i^ will have a density with respect to 
the natural length measure on the tree. It is natural to associate this collection 
of probability distributions with the single distribution ^2 i ^i/n, where n is the 
number of query sequences. 

For large data sets, it is not practical to record detailed information about the 
posterior probability distribution. Thus, in the implementation of |Matsen et al.| 
(2010), the posterior probability is computed on a branch- by-branch basis for a 
given query sequence by integrating out the attachment location and the pendant 
branch length, resulting in a probability for each branch. The mass is then assigned 
to the attachment location of the maximum likelihood phylogenetic placement. 
With this simplification, we are back in the point placement situation in which 
each query sequence is assigned to a single point on the reference tree and the 
collection of assignments is described by the empirical distribution of this set of 
points. However, since it is possible in principle to work with a representation of a 
sample that is not just a discrete distribution with equal masses on each point, we 
develop the theory in this greater level of generality. 

1.3. Comparing probability distributions on a phylogenetic tree. If one 

wished to use the standard Neyman-Pearson framework for statistical inference 
to determine whether two metagenomic samples came from communities with the 
"same" or "different" constituents, one would first have to propose a family of 
probability distributions that described the outcomes of sampling from a range 



6 STEVEN N. EVANS AND FREDERICK A. MATSEN 

of communities and then construct a test of the hypothesis that the two samples 
were realizations from the same distribution in the family. However, there does not 
appear to be such a family of distributions that is appropriate in this setting. 
We are thus led to the idea of representing the two samples as probability distri- 



butions on the reference tree in the manner described in Subsection 1.2 above, cal- 
culating a suitable distance between these two probability distributions, and using 
the general permutation/randomization test approach reviewed in Subsection |1.1| 
above to assign a statistical significance to the observed distance. 

The key element in implementing this proposal is the choice of a suitable metric 
on the space of probability distributions on the reference tree. There are, of course, 



a multitude of choices: Chapter 6 of Villani (2009) notes that there arc "dozens 



and dozens" of them and provides a discussion of their similarities, differences and 
various virtues. 

Perhaps the most familiar metric is the total variation distance, which is just the 
supremum over all (Borel) sets of the difference between the masses assigned to the 
set by the two distributions. The total variation distance is clearly inappropriate 
for our purposes, however, because it pays no attention to the evolutionary distance 
structure on the tree: if one took k point placements and constructed another set of 
placements by perturbing each of the original placements by a tiny amount so that 
the two sets of placements were disjoint, then the total variation distance between 
the corresponding probability distributions would be 1, the largest it can be for 
any pair of probability distributions, even though we would regard the two sets of 
placements as being very close. Note that even genetic material from organisms of 
the same species can result in slightly different placements due to genetic variation 
within species and experimental error. 

We therefore need a metric that is compatible with the evolutionary distance on 
the reference tree and measures two distributions as being close if one is obtained 
from the other by short range redistributions of mass. The Kantorovich-Rubinstein 
(KR) metric, which can be defined for probability distributions on an arbitrary 
metric space, is a classical and widely used distance that meets this requirement and, 
as we shall see, has other desirable properties such as being easily computable on a 
tree. It is defined rigorously in Section [2] below, but it can be described intuitively 
in physical terms as follows. Picture each of two probability distributions on a 
metric space as a collection of piles of sand with unit total mass: the mass of sand 
in the pile at a given point is equal to the probability mass at that point. Suppose 
that the amount of "work" required to transport an amount of sand from one 
place to another is proportional to the mass of the sand moved times the distance 
it has to travel. Then, the KR distance between two probability distributions 
P and Q is simply the minimum amount of work required to move sand in the 
configuration corresponding to P into the configuration corresponding to Q. It will 
require little effort to move sand between the configurations corresponding to two 
similar probability distributions, while more will be needed for two distributions 
that place most of their respective masses on disjoint regions of the metric space. As 



noted by Villani ( 2009 ) , the KR metric is also called the Wasserstein(l) metric or, in 
the engineering literature, the earth mover's distance. We note that mass-transport 
ideas have already been used in evolutionary bioinformatics for the comparison 
and clustering of "evolutionary fingerprints"- such a fingerprint being defined by 
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Kosakovsky Pond et al. (2010) as a discrete bivariate distribution on synonymous 



and nonsynonymous mutation rates for a given gene. 

1.4. Overview of results. Our first result is that in the phylogenetic case, the 
optimization implicit in the definition of the KR metric can be done analytically, 
resulting in a closed form expression that can be evaluated in linear time, thereby 
enabling analysis of the volume of data produced by large-scale sequencing studies. 
Indeed, as shown in Section [2] the metric can be represented as a single integral 
over the tree, and for point placements the integral reduces to a summation with 
a number of terms on the order of the number of placements. In contrast, com- 
puting the KR metric in Euclidean spaces of dimension greater than one requires 
a linear programming optimization step. It is remarkable that the point version of 
this closed-form expression for the phylogenetic KR distance (although apparently 
not the optimal mass transport justification for the distance) was intuited by mi- 
crobial ecologists and is nothing other than the weighted UniFrac distance recalled 
in Subsection 11.11 above. 

We introduce LP generalizations of the KR metric that are analogous to ones 



on the real line due to Zolotarev (Rachev 1991 Rachev and Riischendorf 1998) 



- the KR metric corresponds to the case p = 1. Small p emphasizes primarily 
differences due to separation of samples across the tree, while large p emphasizes 
large mass differences. The generalizations do not arise from optimal mass transport 
considerations, but we remark in Section |5.3| that the square of the p — 2 version 
does have an appealing ANOVA-like interpretation as the amount of variability in 
a pooling of the two samples that is not accounted for by the variability in each of 
them. 

We show in Section [3] that the distribution of the distance under the null hy- 
pothesis of no clustering is approximately that of a readily-computable functional 
of a Gaussian process indexed by the tree and that this Gaussian process is rela- 
tively simple to simulate. Moreover, we observe that when p — 2 this approximate 
distribution is that of the square root of a weighted sum of \i random variables. 
We also discuss the interpretation of the resulting p value when the data exhibit 
local "clumps" that might be viewed as being the objects of fundamental biological 
interest rather than the individual reads. 

In Section [51 we discuss alternate approaches to sample comparison. In par- 
ticular, we remark that any probability distribution on a tree has a well-defined 
barycenter (that is, center of mass) that can be computed effectively. Thus, one 
can obtain a one point summary of the location of a sample by considering the 
barycenter of the associated probability distribution and measure the similarity of 
two samples by computing the distance between the corresponding barycenters. 

2. The phylogenetic Kantorovich-Rubinstein metric 

In this section we more formally describe the phylogenetic Kantorovich-Rubinstein 
metric, which is a particular case of the family of Wasserstein metrics. We then 
use a dual formulation of the KR metric to show that it can be calculated in linear 
time via a simple integral over the tree. We also introduce a Zolotarev-type LP 
generalization. 

Let T be a tree with branch lengths. Write d for the path distance on T. We 
assume that probability distributions have been given on the tree via collections of 
either "point" or "spread" placements as described in the introduction. 
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For a metric space (S,r), the Kantorovich- Rubinstein distance Z(P,Q) between 
two Borel probability distributions P and Q on S is defined as follows. Let 1Z(P, Q) 
denote the set of probability distributions R on the product space S x S with the 
property R(A x S) = P{A) and R{S x B) = Q(B) for all Borel sets A and B (that 
is, the two marginal distributions of R are P and Q). Then, 

(4) Z(P, Q) := inf if r(x, y) R(dx, dy) : R G K(P, Q) } ; 

UsxS ) 



Ambrosio et al. 2008 Villani. 20091 



see, for example, (Rachev 1991 Rachev and Riischendorf 1998 Villani 2003 



There is an alternative formula for Z(P, Q) that comes from convex duality. 
Write C for the set of functions / : S —$■ K with the Lipschitz property \f(x) — 
f{y)\ < r ( x >y) f° r an x,y <E S. Then, 

Z(P, Q) = sup U f(x) P(dx) - J f(y) Q(dy) : f € C 

We can use this expression to get a simple explicit formula for Z(P, Q) when 
(S,r) = (T,d). 

Given any two points x,y € T, let [x,y] be the arc between them. There is a 
unique Borel measure A on T such that A([a;, y}) = d(x, y) for all x, y G T. We call 
A the length measure; it is analogous to Lebesgue measure on the real line. Fix a 
distinguished point p e T, which we call the "root" of the tree. For any / e C 
with f(p) = 0, there is an A-a.e. unique Borel function g : T — > [— 1, 1] such that 
f(x) — J, , g(y) \{dy) (this follows easily from the analogous fact for the real line). 

Given x G T, put t(x) :— {y G T : x G [p, y]}; that is, if we draw the tree with 
the root p at the top of the page, then r{x) is the sub-tree below x. Observe that 
if h : T —} K is a bounded Borel function and p, is a Borel probability distribution 
on T, then we have the integration-by-parts formula 

h(y)X(dy)\ p{dx)= I l[ p , x ](y)h(y)(jj,®\)(dx,dy) 
It \J[p,x] J Jtxt 

l T(y) (x)h(y) (p <g> X)(dx, dy) 

TxT 

h(y) I f p(dx)) X(dy) 

I \Jr(y) J 

h{y)p(T{y))\(dy). 

T 

Thus, if P and Q are two Borel probability distributions on T and / : T — > R is 
given by f(x) = J [px] g{y) X(dy), then we have 




f(x)P(dx)= / / g(y)X(dy)\ P(dx) = / g(y)P(r(y)) X(dy), 

T JT \J[p,x] ) JT 

and an analogous formula holds for Q. Hence, 

Z(P, Q) = sup I J g(y) [P(r(y)) - Q(r(y))] X(dy) :-l<g<+l 
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It is clear that the integral is maximized by taking g(y) = +1 (resp. g(y) = —1) 
when P{t{ v )) > Q{r{y)) (resp. P(r(y)) < Q(r(y))), so that 



(5) 



Z(P,Q) = I \P{r{y))-Q{r{y))\ \{dy). 
Jt 



Note that (fTl) is the special case of ^ that arises when P assigns point mass 
1/m to each of the leaves in community A, and Q assigns point mass 1/n to each 
of the leaves in community B. 

We can generalize the definition of the Kantorovich-Rubinstein distance by tak- 
ing any pseudo-metric / on [0, 1] and setting 



Z f (P,Q) 



f(P(T(y)),Q(r(y)))X(dy). 



This object will be a pseudo-metric on the space of probability distributions on the 
tree T. All of the distances considered so far are of the form Zj for an appropriate 
choice of the pseudo-metric /: (unweighted) UniFrac results from taking f(x,y) 
equal to one when exactly one of x or y is greater than zero, and Z arises when 
f(x,y) = \x-y\. 

Furthermore, if f(x,y) — /(l — x, 1 — y), then Zf is invariant with respect to 
the position of the root. Indeed, for A-a.e. y £ T we have that y is in the interior 
of a branch and P({y}) = Q({y}) = so that, for such y, P(r{y)) and P(T\r(y)) 
(respectively, Q(r(y)) and Q(T\r(y))) are the P-masses (respectively, Q-masses) 
of the two disjoint connected components of T produced by removing y (cf. pi), 
and hence these quantities don't depend on the choice of the root. Because 



f(P(r(y)), Q{r(y))) = - [f(P(r( y )),Q(r(y))) + /(l - P(r(y)), 1 - Q(r(y)))} 
- ~ [f(P(r(y)),Q(r(y))) + f(P(T \ r(y)),Q(T \ r(y)))} 

for any y £ T, the claimed invariance follows upon integrating with respect to A. 
In particular, we see that the distance Z is invariant to the position of the root, a 
fact that is already apparent from the original definition Q. 

In a similar spirit, the KR distance as defined by the integral |5| can be gener- 
alized to an L p Zolotarev-type version by setting 



ZJP,Q) 



\P(r(y))-Q(r(y))\ p X(dy) 



for < p < oo - see (Rachev 1991 Rachev and Riischendorf 1998) for a discussion 



of analogous metrics for probability distributions on the real line. Intuitively, large 
p gives more weight in the distance to parts of the tree which are maximally different 
in terms of P and Q, while small p gives more weight to differences which require lots 
of transport. The position of the root p also does not matter for this generalization 
of Z by the argument above. 
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As the following computations show, the distance Z2 has a particularly appealing 
interpretation. First note that 



Zl(P,Q) 



\P(t(u)) - Q(t(u))\ 2 X(du) 



= / P{t{u)) 2 \{du) - 2 / P(t(u))Q(t(u)) \{du) 
Jt Jt 



Q{t{u)) 2 \{du) 



t Jt 



![/>,«] ( w )![p,u.] («) P(dv) P(dw) 



A(d«) 



t Jt 



![/>,«] {u)l\p, w ] («) ^(dv) Q(dw) 



A(du) 



T JT 



l[ P ,t)] (w)l[p,i«] ( u ) Q(du) Q(diu) 



A(d«) 



Now, the product of indicator functions l[p,«]l[p,«;] is the indicator function of 
the set [p,v] f~l [p, it;]. This set is an arc of the form [p,v A w], where v A it; is 
the "most recent common ancestor" of v and w relative to the root p. Hence, 
f T l\p, v ]{u)l[p jW ](u)X(du) = \{[p,vAw}) isd^p^vAw) = \ [d(p,v) +d(p,w) - d(v,w)]. 
Therefore, 



z!(p,q) = 



1 



2 / / d(v,w)P(dv)Q(dw) 

t Jt 

d(v,w)P{dv)P{dw)- 

T JT 



d(v, w) Q(dv) Q(dw) 



T JT 



Thus, if X' , X" , Y', Y" are independent T-valued random variables, where X' , X" 
both have distribution P and Y', Y" both have distribution Q, then 

(6) zl(P, Q) = 1 (E[d(JC, y') - d(jf' , x")] + E[d(x', f') - d(r', r")]) ■ 

Analogous to weighted UniFrac, one can "normalize" the KR distance by dividing 
it by a scalar. The most direct analog of the scaling factor D used for weighted 
UniFrac on a rooted tree (|3| would be twice the KR distance between (P + Q)/2 
and a point mass located at the root. This is an upper bound by the triangle 
inequality. A root-invariant version would be to instead place the point mass at 
the center of mass (that is, the barycenter, see Section [5^2| of (P + Q)/2, and twice 
the analogous distance is again an upper bound by the triangle inequality. It is 
clear from the original definition of the KR distance Q, that Z\(P, Q) is bounded 
above by the diameter of the tree (that is, max Ii5 d(x, y)) or by the possibly smaller 
similar quantity that arises by restricting x and y to the respective supports of P 
and Q. Any of these upper bounds can be used as a "normalization factor". 

The goal of introducing such normalizations would be to permit better compar- 
isons between distances obtained for different pairs of samples. However, some care 
needs to be exercised here: it is not clear how to scale distances for pairs on two very 
different reference trees so that similar values of the scaled distances convey any 
readily interpretable indication of the extent to which the elements of the two pairs 
differ from each other in a "similar" way. In short, when comparing results between 
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trees, the KR distance and its generalizations are more useful as test statistics than 
as descriptive summary statistics. 

3. Assessing significance 

To assess the significance of the observed distance between the probability dis- 
tributions associated with a pair of samples of placed reads of size m and n, we use 
the permutation strategy mentioned in the Introduction for assigning significance 
to observed UniFrac distances. In general, we have a pair of probability distri- 
butions representing the pair of samples that is of the form P = ^Y^iLi^i an d 
Q = — Y^j=m+i n j> where TTf- is a probability distribution on the reference tree T 
representing the placement of the fc th read in a pooling of the two samples (in the 
point placement case, each iTk is just a unit point mass at some point Wk G T). 
We imagine creating all ( m+n ) pairs of "samples" that arise from placing m of the 
reads from the pool into one sample and the remaining n into the other, computing 
the distances between the two probability distributions on the reference tree that 
result from the placed reads, and determining what proportion of these distances 
exceed the distance observed in the data. This proportion may be thought of as a 
p-value for a test of the null hypothesis of no clustering against an alternative of 
some degree of clustering. 

Of course, for most values of m and n it is infeasible to actually perform this 
exhaustive listing of distances. We observe that if I C {1, . . . , m + n} is a uniformly 
distributed random subset with cardinality m (that is, all ( m r ^™) values are equally 
likely), J :— I c is the complement of /, P is the random probability distribution 
m TliPi ^ii ano - Q i s t ne random probability distribution i X^gj Tj'i then the pro- 
portion of interest is simply the probability that the distance between P and Q 
exceeds the distance between P and Q. We can approximate this probability in 
the obvious way by taking independent replicates of (I, J) and hence of (P, Q) and 
looking at the proportion of them that result in distances greater than the observed 
one. We illustrate this Monte Carlo approximation procedure in Section [4] 

3.1. Gaussian approximation. Although the above Monte Carlo approach to 
approximating a p-value is conceptually straightforward, it is tempting to explore 
whether there are further approximations to the outcome of this procedure that 
give satisfactory results but require less computation. 

Recall that -n\ , . . . , 7r m+n is the pooled collection of placed reads and that P = 
— X^ G j 7Tj and Q = — ^2 ie j^j, where / is a uniformly distributed random subset 
of {1, . . . , m + n} and J is its complement. Write 

Gk(u) '■= 7Tfe(r(u)) for any ueT, 1 < k < m + n, 

where we recall that t(u) is the tree below u relative to the root p. Define a 
T-indexed stochastic process X = (X(u)) u qt by 

X(u) := P(t(u)) - Q(r(«)) 

iei jeJ 

Then, 

" r 1 p a1 

Z p (P,Q)= / \X(u)\ p X(du) 
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If Hk, 1 < k < to + n, is the indicator random variable for the event {k e /}, 
then 

va+n 



X(u) = Y, 



k=i 



to n J n 



G k (u). 



Writing E, V, and C for expectation, variance, and covariance, we have 

m 



E[H t 
V[Hi] = 



m + n 
to n 



and 

It follows that 

and 
C[X(u),X(v)} 



ClHijHj 



- \J2G l (u)G t (v) 
in \ z — ' 

V • 

Y / G l (u)G l (v) 



1 

mn 

1 
mn 

1 
mn 



m + n m + n 

1 m n 

m + n — lm + nm + n' 

E[X(u)} = 

1 y^Gi(u)Gj(v) 

n + n-1 ^ v ' n J 

1 J2 G M G & 



i ^ i- 



m + n 



'■] 



J2G t (u)G t (v) - (m + n) ( ^—J2 G *( u n \ ~Ar £ G ^ 
*—' \ to + n z -^ I Xm + n*— 1 



£ (Gi(u) - G(u)) (Gi(t/) - G(v)) 



=: T(u,v) 

when to + n is large, where G(u) := ^^ J^k Gk(u). 

Remark 3.1. In the case of point placements, with the probability distribution TTk 
being the point mass at Wk € T for 1 < k < m + n, then 



T(u,v) 



1 

mn 



^2 #{fc : u € [p, Wk], V € [p, w k }} 

. k 

1 



m + n 



#{fc : u e [p, Wfe] }#{fc : w e [p,w k ]} 



By a standard central limit theorem for exchangeable random variables (see, 
e.g., Theorem 16.23 of Kallcnbcrg, 2001), the process X is approximately Gaussian 
with covariance kernel T when to + n is large. A straightforward calculation shows 
that we may construct a Gaussian process £ with covariance kernel T by taking 
independent standard Gaussian random variables 771 , ... , r) m+n and setting 



£(«) 



TOn ^-^ to + n \ ^-^ / \ *—' J 
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It follows that the distribution of Z p (P, Q) is approximately that of the random 
variable 



(7) 



1 iAl 



\au)\ p x(du) 

T 



One can repeatedly sample the normal random variates r\i and numerically integrate 
Q to approximate the distribution of this integral. In the example application of 
Section pi this provides a reasonable though not perfect approximation (Figure [3]). 
There is an even simpler approach for the case p = 2. Let n\, k — 1,2,..., 
and ipki k — 1,2,..., be the positive eigenvalues and corresponding normalized 
eigenfunctions of the non-negative definite, self-adjoint, compact operator on L 2 (X) 
that maps the function / to the function J„ T(-, v)f(v) X(dv). The functions fik^k, 
k = 1,2,..., form an orthonormal basis for the reproducing kernel Hilbert space 
associated with F and the Gaussian process £ has the Karhunen-Loeve expansion 

£(«) = ^2^kipk(u)Vk, 

k 

where r)k, k = 1, 2, . . ., are independent standard Gaussian random variables - see 



(Jain and Marcus 19781 for a review of the theory of reproducing kernel Hilbert 



spaces and the Karhunen-Loeve expansion. 
Therefore, 



[ \t(u)\ 2 \(du)=J2 



2 2 
HkVki 



and the distribution of Z%(P, Q) is approximately that of a certain positive linear 
combination of independent \i random variables. 

The eigenvalues of the operator associated with T can be found by calculating the 
eigenvalues of a related matrix as follows. Define an (m + n) x (771 + 71) non- negative 
definite, self- adjoint matrix M given by 

My ■=— f (Gi(u) - G(u)) (Gjiu) ~ G(u)) X(du). 

Note that if we have point placements at locations Wk G T for 1 < k < m + n as 
in Remark |3.1 1 then 

M = — (l- — ^ — ll^^j n(i- -J— 11 



777,77 \ 777 + 77 / \ 777 + 77 

where / is the identity matrix, 1 is the vector which has 1 for every entry, and the 
matrix N has (i, j) entry given by the distance from the root to the "most recent 
common ancestor" of Wi and Wj . 

Suppose that x is an eigenvector of M for the positive eigenvalue v 2 . Set 

(8) i>{u):= J2 (Gj(u) -G(u)) Xj . 
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Observe that 

/ T(u,v)tp{v) \{dv) 

JT 

= ^f £(<?,(«) -G(u))(G«(t;)-G(i;)) 

L i 



J2(G,(v)-G(v))*iHdv) 



V 2 Xi 



= Y,{G i {u)-G{u)) 

i 

= v 2 ij){u) : 

and so %p is an (unnormalizcd) cigcnfunction of the operator on L 2 (X) defined by 
the covariance kernel V with eigenvalue v 2 . 

Conversely, suppose that /i 2 is an eigenvalue of the operator with eigenfunction 
6. Set 



x j ■= / (Gj(v)-G{v))(f>(v)X(dv). 



Then, 



E M *: 



E — / (Gi(«) - G(u)) (G,-(u) - G(u)) A(d«) 



mn J T 



x / (Gj(v)-G(v)) <f>(v) \(dv) 

JT 

= [ (Gi(«)-G(«)) 



-J- E (G,N - G(«)) (G» - G(v)) 0(«) A(dv) 



A(du) 



(Gi(«)-G(«)) 



r(u, v)<p(v) X(dv) 



X{du) 



(Gi{u) - G{u)) ^ 2 <j){u)X{du) 



— /X Xi 1 

so that ^ 2 is an eigenvalue of M with (unnormalized) eigenvector of x. 

It follows that the positive eigenvalues of the operator associated with T coincide 
with those of the matrix M and have the same multiplicities. 

However, we don't actually need to compute the eigenvalues of M to implement 
this approximation. Because M is orthogonally equivalent to a diagonal matrix with 
the eigenvalues of M on the diagonal, we have from the invariance under orthogonal 
transformations of the distribution of the random vector r\ := (771, . . . , r/ m+n ) T that 
Sfe ^k^k nas the same distribution as rj Mr). Thus, the distribution of the random 
variable Z 2 (P, Q) is approximately that of yv ; Mijijirjj. 
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One might hope to go even further in the p — 2 case and obtain an analytic 
approximation for the distribution ^ fc [i\rff. or a useful upper bound for its right 
tail. It is shown in (Hwang 1980) that if we order the positive eigenvalues so that 
Mi > A*i ^ • ■ • an d assume that \x\ > /x|, then 




V^ 2 2 \ 



Mi 



n 






exp 



2M? 



in the sense that the ratio of the two sides converges to one as r — > oo. It is not clear 
what the rate of convergence is in this result and it appears to require a detailed 
knowledge of the spectrum of the matrix M to apply it. 

Gaussian concentration inequalities such as Borell's inequality (see, for example, 
Section 4.3 of (Bo gachev] |1998 )) give bounds on the right tail that only require a 



knowledge of E[(^ fc Mfc??!) 3 ] anc ^ Mi> but these bounds are far too conservative for 
the example in Section [4] 

There is a substantial literature on various series expansions of densities of posi- 
tive linear combinations of independent Xi random variables. Some representative 



papers are (Robbins and Pitman 1949 Gurland 1955 Pachares 



1962 Kotz et al. 1967 Gideon and Gurland 19761. However, it seems that ap 



1955 Ruben 



plying such results would also require a detailed knowledge of the spectrum of the 
matrix M as well as a certain amount of additional computation to obtain the 
coefficients in the expansion and then to integrate the resulting densities, and this 
may not be warranted given the relative ease with which it is possible to repeatedly 
simulate the random variable r) Mr). 

Even though these more sophisticated ways of using the Gaussian approximation 
may not provide tight bounds, the process of repeatedly sampling normal random 
variates r\i and numerically integrating the resulting Gaussian approximation Q 
does provide a useful way of approximating the distribution obtained by shuffling. 
This approximation is significantly faster to compute for larger collections of place- 
ments. For example, we considered a reference tree with 652 leaves and 5 samples 
with sizes varying from 3372 to 15633 placements. For each of the 10 pairs of sam- 
ples, we approximated the distribution of the Z\ distance under the null hypothesis 
of no difference by both creating "pseudo-samples" via random assignment of reads 
to each member of the pair ( "shuffling" ) and by simulating the Gaussian process 
functional with a distribution that approximates that of the Z\ distance between 
two such random pseudo-samples. We used 1000 Monte Carlo steps for both ap- 
proaches. The (shuffle, Gaussian) run-times in seconds ranged from (494.1, 36.8) 
to (36.1, 2.2); in general, the Gaussian procedure ran an order of magnitude faster 
than the shuffle procedure. 



3.2. Interpretation of p- values. Although the above-described permutation pro- 
cedure is commonly used to assess the statistical significance of an observed dis- 
tance, we discuss in this section how its interpretation is not completely straight- 
forward. 

In terms of the classical Neyman-Pearson framework for hypothesis testing, we 
are computing a p- value for the null hypothesis that an observed subdivision of a set 
of to + n objects into two groups of size to and n looks like a uniformly distributed 
random subdivision against the complementary alternative hypothesis. For many 
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purposes, this turns out to be a reasonable proxy for the imperfectly-defined notion 
that the two groups are "the same" rather than "different." 

However, a rejection of the null hypothesis may not have the interpretation that 
is often sought in the microbial context - namely, that the two collections of reads 
represent communities that are different in biologically relevant ways. For example, 
assume that m = n = NK for integers N and K. Suppose that the placements 
in each sample are obtained by independently laying down N points uniformly 
(that is, according to the normalized version of the measure A) and then putting 
K placements at each of those points. The stochastic mechanism generating the 
two samples is identical and they are certainly not different in any interesting way, 
but if K is large relative to N the resulting collections of placements will exhibit a 
substantial "clustering" that will be less pronounced in the random pseudo-samples, 
and the randomization procedure will tend to produce a "significant" p-value for 
the observed KR distance if the clustering is not taken into account. 

These considerations motivate consideration of randomization tests performed 
on data which is "clustered" on an organismal level. Clustering reads by organism 



is a difficult task and an active research topic (White et al. 2010 1. A thoroughgoing 



exploration of the effect of different clustering techniques is beyond the scope of this 
paper, but we examine the impact of some simple approaches in the next section. 



4. Example application 
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Figure 1. Tree with branches thickened as a linear function of 
the number of placements in the control sample placed on that 
branch. 



To demonstrate the use of the Z p metric in an example application, we inves- 
tigated variation in expression levels for the psbA gene for an experiment in the 



Sargasso Sea (Vila-Costa et al.l 2010). Metatranscriptomic data was downloaded 



from the CAMERA website (http://camera.calit2.net/), and a psbA alignment was 
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FIGURE 2. Tree as in Figure [T] but for the DMSP-treated sample. 



supplied by Robin Kodner. Searching and alignment was performed using HMMER 
(Eddy 1998), a reference tree was inferred using RAxML (Stamatakis. 2006), and 
phylogenetic placement was performed using pplacer ( Matsen et al.| |2010[ ). The 



calculations presented here were performed using the "guppy" binary available as 
part of the pplacer suite of programs (http://matsen.fhcrc.org/pplacer). 

Visual inspection of the trees fattened by number of placements showed the same 
overall pattern with some minor differences (Figure [l] and f2J) . However, application 
of the KR metric revealed a significant difference between the two samples. The 
value of Z\ for this example (using spread placements and normalizing by total 
tree length) was 0.006601. This is far out on the tail of the distribution (Figure pBl, 
and is in fact larger than any of the 1000 replicates generated via shuffling or the 
Gaussian-based approximation. 

Such a low p-value prompts the question of whether the center of mass of the 
two distributions is radically different in the two samples (see Section [572] ) . In this 
case, the answer is no, as the two barycenters are quite close together (Figure El 
see Section 5.2). 



It was not intuitively obvious to us how varying p would affect the distribution 
of the Z p distance under the null hypothesis of no clustering. To investigate this 
question, we plotted the observed distance along with boxplots of the null distribu- 
tion for a collection of different p (Figure [5]). It is apparent that there is a consistent 
conclusion over a wide range of values of p. 

One can also visualize the difference between the two samples by drawing the 
reference tree with branch thicknesses that represent the minimal amount of "mass" 
that flows through that branch in the optimal transport of mass implicit in the 
computation of Zi(P,Q) and with branch shadings that indicate the sign of the 
movement (Figure p3j). 

Next we illustrate the impact of simple clustering on randomization tests for 
KR. The clustering for these tests will be done by rounding placement locations 
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Figure 3. Comparison of the distribution of Z\ and Z^ distances 
obtained by shuffling, Gaussian approximation, and the observed 
value (marked with x) for the example data set. 
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Figure 4. Dendrogram with barycenters marked. Circle is the 
control sample, and star is the sample treated with DMSP. 
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Figure 5. Plot showing sample (point) and randomized ranges 
(box-and- whisker) . Outliers eliminated for clarity. For each p, the 
distribution was rescaled by subtracting the mean and dividing by 
the standard deviation. 



using two parameters: the mass cutoff C and the number of significant figures S as 
follows. Placement locations with low probability mass for a given read are likely 



to be error-prone (Matsen et al. 2010), thus the first step is to through away those 



locations associated with posterior probability or "likelihood weight ratio" below C. 
The second step is to round the placement attachment location and pendant branch 
length by multiplying them by 10 and rounding to the nearest integer. The reads 
whose placements are identical after this rounding process are then said to cluster 
together. We will call the number of reads in a given cluster the "multiplicity" of 
the cluster. 

After clustering, various choices can be made about how to scale the mass dis- 
tribution according to multiplicity. Again, each cluster has some multiplicity and 
a distribution of mass across the tree according to likelihood weight. One option 
(which we call straight multiplicity) is to multiply the mass distribution by the mul- 
tiplicity. Alternatively, one might forget about multiplicity by distributing a unit 
of mass for each cluster irrespective of multiplicity. Or one might do something 
intermediate by multiplying by a transformed version of multiplicity; in this case 
we transform by the hyperbolic arcsine, asinh. 

We calculated distances and p- values for several clustering parameters and mul- 
tiplicity uses (Table [T]). To randomize a clustered collection of reads, we reshuffled 
the labels on the clusters, maintaining the groupings of the reads within the clusters; 
thus, all the placements in a given cluster were assigned to the same pseudo-sample. 
The distances do not change very much under different collections of clustering pa- 
rameters, as there is little redistribution of mass. On the other hand, the p- values 
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Figure 6. A tree displaying the optimal movement of mass for the 
KR metric. When moving from the first probability distribution 
to the second, branches marked in gray have mass moving towards 
the root, while those marked in black have mass moving towards 
the leaves. Thickness shows the quantity of mass moving through 
that branch. 



S C 



strict Z\ strict p asinh Z\ asinh p unit Z\ unit p 



1 


0.01 


0.006578 


0.0087 


0.007016 


0.0008 


0.007054 


0.0003 


1 


0.05 


0.006584 


0.0218 


0.006986 


0.0018 


0.007036 


0.0005 


1 


0.1 


0.006562 


0.035 


0.007214 


0.001 


0.007322 


0.0005 


2 


0.01 


0.006601 


0.0018 


0.007076 


0.0003 


0.007281 


0.0001 


2 


0.05 


0.006587 


0.0029 


0.00696 


0.0005 


0.007111 


0.0002 


2 


0.1 


0.006592 


0.0039 


0.007088 


0.0003 


0.007423 





3 


0.01 


0.006601 


0.0017 


0.006806 


0.0005 


0.006922 


0.0002 


3 


0.05 


0.006602 


0.0018 


0.006719 


0.0003 


0.006695 


0.0001 


3 


0.1 


0.006612 


0.0012 


0.006775 


0.0003 


0.006816 


0.0001 



Table 1. Distances (Z\) and significance levels (p) for various 
choices of clustering parameters and multiplicity interpretations 
described in the text for 10,000 randomizations. 



are different, because under our randomization strategy mass is relabeled on a 
cluster-by-cluster level. The different choices represented in this table represent 
different perspectives on what the multiplicities mean. The "strict" multiplicity- 
based p- value corresponds to interpreting the multiplicity with which reads appear 
as meaningful, the unit cluster p- value corresponds to interpreting the multiplicities 
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as noise, and the asinh-transformed multiplicity sits somewhere in between. The p- 
value with no clustering (as above, Z^ — 0.006601, with a p-value of 0) corresponds 
to interpreting reads as being sampled one at a time from a distribution. 

The choice of how to use multiplicity information depends on the biological set- 
ting. There is no doubt that increased organism abundance increases the likelihood 
of sampling a read from that organism, however the relationship is almost certainly 



nonlinear and dependent on species and experimental setup (Morgan et al. 2010) 



How multiplicities are interpreted and treated in a specific instance is thus a deci- 
sion that is best left to the researcher using his/her knowledge of the environment 
being studied and the details of the experimental procedure. 



5. Discussion 



5.1. Other approaches. 



5.1.1. Operational Taxonomic Units (OTUs). The methods described in this pa- 
per are complementary to comparative methods based on "operational taxonomic 
units" (OTUs). OTUs are groups of reads which are assumed to represent the reads 
from a single species, and are typically heuristically defined using a fixed percent- 
age sequence similarity cutoff. A comparative analysis then proceeds by comparing 
the frequency of various OTUs in the different samples. There has been some 
contention about whether OTU-based methods or phylogenetic based methods are 
superior- e.g. (Schloss 2008) and (Lozupone et al. 2010)- but most studies use a 



combination of both, and the major software packages implement both. A recent 
comparative study for distances on OTU abundances can be found in a paper by 



Kuczynski et al. (2010). 



5.1.2. Other phylogenetic approaches. There are a number of ways to compare mi- 
crobial samples in a phylogenetic context besides the method presented here. One 
popular means of comparing samples is the "parsimony test," by which the most 
parsimonious assignment of internal nodes of the phylogenetic tree to communities 
is found; the resulting parsimony score is interpreted as a measure of difference be- 



tween communities (Slatkin and Maddison 1989 Schloss and Handelsman 2006). 



Another interesting approach is to consider a "generalized principal components 
analysis" whereby the tree structure is incorporated into the process of finding prin- 



cipal components of the species abundances (Bik et al. 2006 Purdom 2008). The 



Kantorovich-Rubinstein metric complements these methods by providing a means 
of comparing samples that leverages established statistical methodology, that takes 
into account uncertainty in read location, and can be visualized directly on the tree. 
There are other metrics that could be used to compare probability distributions 
on a phylogenetic tree. The metric on probability distributions that is most familiar 
to statisticians other than the total variation distance is probably the Prohorov 
metric and so they may feel more comfortable using it rather than the KR metric. 
However, the Prohorov metric is defined in terms of an optimization that does not 
appear to have a closed form solution on a tree and, in any case, for a compact 
metric space there are results that bound the Prohorov metric above and below by 
functions of the KR metric (see Problem 3.11.2 of Ethier and Kurtz; 1986 ) so the 
two metrics incorporate very similar information about the differences between a 
pair of distributions. 
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5.2. The barycenter of a probability distribution on a phylogenetic tree. 

It can be useful to compare probability distributions on a metric space by calcu- 
lating a suitably denned "center of mass" that provides a single point summary for 
each distribution. Recall the standard fact that if P is a probability distribution 
on a Euclidean space such that J \y — x\ 2 P(dy) is finite for some (and hence all) x, 
then the function x i— > J \y — x\ 2 P(dy) has a unique minimum at x — J y P(dy). 
A probability distribution P on an arbitrary metric space (S, r) has a "center of 
mass" or barycenter at xq if j r(x,y) 2 P(dy) is finite for some (and hence all x) 
and the function x h-» J r(x,y) 2 P{dy) has a unique minimum at Xq. In terms of 
the concepts introduced above, the barycenter is the point x that minimizes the Z 2 
distance between the point mass S x and P. 

Barycenters need not exist for general metric spaces. However, it is well-known 
that barycenters do exist for probability distributions on Hadamard spaces. A 
Hadamard space is a simply connected complete metric space in which there is a 
suitable notion of the length of a path in the space, the distance between two points 
is the infimum of the lengths of the paths joining the points, and the space has non- 



positive curvature in an appropriate sense - see (Burago et al. 2001 ). Equivalcntly, 



a Hadamard space is a complete CAT(O) space in the sense of (Bridson and Hae- 



fliger 19991 



It is a straightforward exercise to check that a tree is a Hadamard space - see 
Example 11.1.15(4) of (Bridson and Hacfligcr, 1999) and note the remark after 



Definition II. 1.1 of (Bridson and Haefliger| |1999[) that a Hadamard space is the 



same thing as a complete CAT(O) space. Note that CAT(O) spaces have already 
made an appearance in phylogenetics in the description of spaces of phylogenetic 
trees ( |Billera et aLJ|2001| ). 



The existence of barycenters on the tree (T, d) may also be established directly as 
follows. As a continuous function on a compact metric space, the function / : T — > 
W. + defined by f(x) := J T d(x,y) 2 P{dy) achieves its infimum. Suppose that the 
infimum is achieved at two points x' and x". Define a function 7 : [0, d(x / , x")] — > 
[x',x"], where [x',x"] C T is the arc between x' and x" , by the requirement that 
7(i) is the unique point in [a/, 2"] that is distance t from x'. It is straightforward 
to check that the composition / o 7 is strongly convex; that is, 

(/ o 7) (ar + (1 - a)s) < a(f o 7)(r) + (1 - a)(/o 7 )( s ) 

for < a < 1 and r,s € [0,d(x',x")}. In particular, f('j(d(x',x")/2)) = (/ o 
j)(d(x',x")/2) < (f(x') + f(x"))/2, contradicting the definitions of x' and x" . 
Thus, a probability distribution on a tree possesses a barycenter in the above sense. 
We next consider how to compute the barycenter of a probability distribution 
P on the tree (T, d). For each point u £ T there is the associated set of directions 
in which it is possible to proceed when leaving u. There is one direction for every 
connected component of T \ {u}. Thus, there is just one direction associated with 
a leaf, two directions associated with a point in the interior of a branch, and k 
associated with a vertex of degree k. Given a point u and a direction S, write 
T(u, S) for the subset of T consisting of points v 7^ u such that the unique path 
connecting u and v departs u in the direction (5, set 

D(u, S):=- f d(u, y) P(dy) + / d(u, y) P(dy), 

JT{u,S) JT\T(u,5) 
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and note that 



lini 



d(u,v) 



d(v,y) J P(dy) - / d(u,y) 2 P(dy) 

T JT 



2D(u,S), 



where the limit is taken over v — > u, v G T(u,6). Note that if u is in the interior 
of a branch [a, b] and b is in the direction S from u, u is in the direction a from a, 
and u is in the direction /3 from b, then 

D(tt, 5) = - / d(u, y) P{dy) - f d(u, y) P(dy) 

JT\T(b,/3) J(u,b) 

d(u,y)P(dy)+ f d{u,y)P{dy) 

T\T(a,a) J (o,u) 

d(a,2/)P(dy) + d(a,n)P(T\T(6,/3)) 

T\T(fc,/3) 

d(a,y)P(dy) + d(a,u)P{(u,b)) 

(u,6) 

d(o, y) P(cfy) + d(o, u)P(T \ T(o, a)) 

T\T(o,a) 

+ d(a, u)P((o, «)) - f d(a, y) P(dy) 

J (a,u) 

= D(a, a) + d(a, u). 

If for some vertex u of the reference tree, D(u, 8) is greater than for all directions 
5 associated with u, then u is the barycenter (this case includes the trivial one in 
which u is a leaf and all the mass of P is concentrated on u). If there is no such 
vertex, then there must be a unique pair of neighboring vertices a and b such that 
D(a,a) < and D(b,/3) < 0, where a is the direction from a pointing towards b 
and (3 is the direction from b pointing towards a. In that case, the barycenter must 
lie on the branch between a and b, and it follows from the calculations above that 
the barycenter is the point u G (a, b) such that d(a,u) = —D(a,a). 

5.3. Z|(P, Q) and ANOVA. In this section we demonstrate how Z\{P^Q) can 
be interpreted as a difference between the pooled average of pairwise distances and 
the average for each sample individually. 

As above, let 7r l7 . . . , 7r m (resp. 7r m+1 , . . . , 7r m+ „) be the placements in the first 
(resp. second) sample, so that each -k^ is a probability distribution on the tree T, 
P=h ££1 Ti. ^d Q = i EKw *-,. Set 

m n „ 1 v—v 

P:=— — -P + — — Q = — ^: $>*• 

m + n m -\- n m-\- n z — ' 

k 

Recall the T-valued random variables X' ', X" ', F', F" that appeared in ([6]). If /', I" 
are {0, l}-valued random variables with P{P = 1} = P{7" = 1} = ^^ and 
X',X",y',Y'",J' ) J" are independent, then defining Z',Z" by Z' = X' on the 
event {I' = 1} (resp. Z" = X" on the event {/" = 1}) and Z' = Y' on the event 
{/' = 0} (resp. Z" = Y" on the event {/" = 0}) gives two T-valued random 
variables with common distribution R. 
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It follows readily from ffih that 



Z%(P,Q) 



1 (m + n) 2 

2 mn 

1 (m + n) 2 

2 TO7J 

m 



m + n 



E[d{Z',Z") 



m + n 
d(v,w)R(dv)R(dw) 



E[d(X',X") 



T JT 



d(v,w)P(dv)P(dw) + 



T JT 



m + n 



m + n 



E[d(Y',Y r 



d(v, w) Q(dv) Q{dw) 



T JT 



Thus, Z\(P,Q) gives an indication of the "variability" present in the pooled col- 
lection 7Tfe, 1 < k < m + n, that is over and above the "variability" in the two 
collections 71$, 1 < i < m, and nj, m + 1 < j < m + n. 



6. Conclusion 

As sequencing becomes faster and less expensive, it will become increasingly 
common to have a collection of large data sets for a given gene. Phylogenetic 
placement can furnish an evolutionary context for query sequences, resulting in each 
data set being represented as a probability distribution on a phylogenetic tree. The 
Kantorovich-Rubinstein metric is a natural means to compare those probability 
distributions. In this paper we showed that the weighted UniFrac metric is the 
phylogenetic Kantorovich-Rubinstein metric for point placements. We explored 
Zolotarev-type generalizations of the KR metric, showed how to approximate the 
limiting distribution and made connections with the analysis of variance. 

The phylogenetic KR metric and its generalizations can be used any time one 
wants to compare two probability distributions on a tree. However, our software im- 
plementation is designed with metagenomic and metatranscriptomic investigations 
in mind; for this reason it is tightly integrated with the phylogenetic placement 
software pplacer (Matsen et al. 2010). With more than two samples, principal 



components analysis and hierarchical clustering can be applied to the pairwise dis- 
tances to cluster environments based on the KR distances as has been done with 



UniFrac ( Lozupone and Knight 2005 Lozupone et al. 2008 Hamady et al. 2009). 



We have recently developed versions of these techniques which leverage the special 



structure of this data (Matsen and Evans 2011 ). 



Another potential future extension not explored here is to partition the tree into 
subsets in a principal components fashion for a single data set. Recall that ^ gives 
a formula for the eigenf unctions of the covariance kernel T given the eigenvectors 
of M. For any k, one could partition the tree into subsets based on the sign of 
the product of the first k eigenfunctions, which would be analogous to partition- 
ing Euclidean space by the hyperplanes associated with the first k eigenvectors in 
traditional principal components analysis. 

Future methods will also need to take details of the DNA extraction procedure 
into account. Recent work shows that current lab methodology is unable to recover 
absolute mixture proportions due to differential ease of DNA extraction between 
organisms (Morgan et al. 20101. However, relative abundance between samples 



for a given organism with a fixed laboratory protocol potentially can be measured, 
assuming consistent DNA extraction protocols are used. An important next step 
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is to incorporate such organism-specific biases into the sort of analysis described 
here. 
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